Project Aim

The aim of this project is to construct a phylogenetic tree of viruses in the order Mononegavirales based on differentiation in their complete genomic sequences obtained from Genbank. The tree will visually iterate the evolutionary relationships between viruses in this large order, including those viruses in several important families whose members are known to be pathogenic to humans. Some such families under the order Mononegavirales are Filoviridae (contains Ebola virus), Rhabdoviridae (includes Rabies virus), and Paromixoviridae (contains Measles and Mumps viruses).

Order Mononegavirales

If you didn’t think english could get any harder… there’s Latin

And if you still don’t understand how to say this word (how could you not?!) here’s something to confuse you a little bit more!

Background Information

Mononegavirales are an order of single stranded negative sense RNA viruses comprising 11 virus families of which many members are capable of causing significant human disease (or disease of vertebrate animalia). These viruses consist of non segmented genomes ranging from 8-20KB in length (8,00 - 20,000 base pairs or about 3,00-7,00 amino acids). Most viruses of the order Mononegavirales replicate in host-cell cytoplasm (with the exception of certain Bornaviruses which replicate in the nuclei), and once abundant in the host body can attack the immune, nervous, excretory, and other major organ systems causing a vast range of symptoms with varied severity. It is important to note that not all Mononegavirus infections are symptomatic, or severe. Some examples of Mononegavirales capable of causing infection in humans are Nipah virus, Ebola, Rabies virus, Measles virus and Mumps virus.

Project Importance

Understanding the evolutionary relationships of viruses especially those causing the pathogenesis of high mortality and morbidity diseases is exceedingly important to the preservation of public health, especially in underdeveloped countries where bio surveillance programs and methods are not as strong or advanced as those in other nations. Tracking species divergence as a result of viromic evolution including genomic mutation over time can help scientists and public health officials to create vaccines, predict local, province, and national epidemics or even pandemics, determine host ranges, and develop effective antivirals and treatments for existing infectious strains as well as potentially related or derived strains. Phylogenetic trees quantitatively represent these evolutionary relationships of specific viral taxa based on genome differentiation as a hypothesis of species divergence over time. These trees explain vast amounts of genomic data in an easy to understand format for scientists and the everyday person alike.

Project Methods

To communicate my final project I will detail how I went about constructing the phylogenetic tree which I built based on the complete genomic sequences of 220 virus isolates in the order Mononegavirales obtained from Genbank

Step 1: Acquire and Load R Packages To Import and Manipulate Genomic Data

Below I load up all R packages, some very unfamiliar to me before the initiation of this project, into my work space

#CRAN Packages
library(tidyverse)
library(dplyr)
library(readr)
library(kableExtra)
library(ape)
library(seqinr)
library(rentrez)
library(devtools)
library(stringr)
library(rentrez)
library(phangorn)

#Bioconductor packages 
library(msa)
library(Biostrings)
library(ggtree)
library(DECIPHER)
library(biomaRt)
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
#github packages
library(compbio4all)
library(ggmsa)

Step 2: Obtaining genomes for viral sequences

Complete genomic sequences for the viruses included in my phylogenetic tree were found using NCBI. The viruses selected were of all complete genomes available for every family in the order Mononegavirales except for Rhabdoviridae. Some genomes in the order Rhabdoviridae were excluded from in my dataset and analysis based on sheer amount of genomes and lack of computing power. Instead of including all complete genomes for Rhaboviruses in my project I used the ‘filter by host’ feature in NCBI and selected those Rhabdoviridae with vertebrate hosts. Lists of Genbank accession numbers were downloaded from NCBI as a seperate file for each family.

My final dataset thus consisted of 70 complete genomes of viruses from the family Rhabdoviridae and 150 complete genomes of viruses in other families within the order Mononegavirales.

Here is the code I used to import the accession lists (saved locally on my computer) into R for further analysis.

filo <- readr::read_lines("./Data/Filoviridae.acc_lst")
art <- readr::read_lines("./Data/Artoviridae.acc_lst")
born <- readr::read_lines("./Data/Bornaviridae.acc_lst")
lisp <- readr::read_lines("./Data/Lispiviridae.acc_lst")
mym <- readr::read_lines("./Data/Mymonaviridae.acc_lst")
nyan <- readr::read_lines("./Data/Nyamiviridae.acc_lst")
para <- readr::read_lines("./Data/Paramyxoviridae.acc_lst")
pneu <- readr::read_lines("./Data/Pneumoviridae .acc_lst")
sun <- readr::read_lines("./Data/Sunshinevirus.acc_lst")
xin <- readr::read_lines("./Data/Xinmoviridae.acc_lst")
rhab <- readr::read_lines("./Data/rhabdovirus.acc_lst")
virus_accession_vector <- c(filo,art,born,lisp,mym,nyan,para,pneu,sun,xin,rhab)

This generates one long vector of accession numbers. Here’s a glimpse:

glimpse(virus_accession_vector)
##  chr [1:220] "NC_016144" "NC_055510" "NC_039345" "NC_014373" "NC_004161" ...

Step 3: Create a FASTA File of All Complete Genome Sequences

The next step for me was naturally to create a FASTA file of all complete genome sequences that one can download and view. This is a way to combine all genomic data into a single document (a specialized list in R) that can be re-imported later for further analysis.

I accomplished this process with the rentrez package using the function entrez_fetch_list(). This function takes in Genbank accession numbers, connects to an NCBI database db (in my case the NCBI nucleotide database) and generates a list of complete genome sequences in FASTA format. The FASTA can then be saved with the function write.fasta() The file is saved onto my computer as virus_genomes.FASTA.

If you want to view the FASTA for yourself here is a link to download: complete list of virus genomes (2.9MB)

Instead of copying and pasting or typing out each individual accession number to fulfill the names argument for the function which would be excruciating, I simply inserted my previously made vector of accession numbers, which is saved as the object virus_accession_vector.

Here is the code to pull complete genome sequences from NCBI by accession numbers and save as a FASTA:

virus_fasta_list <- entrez_fetch_list(db = "nucleotide", 
id =virus_list, 
rettype = "fasta")
write.fasta(sequences = virus_fasta_list, names = names(virus_accession_vector), 
            file.out = "./Data/virus_genomes.FASTA")

Step 4: Create a Data Table

The next thing I wanted to do was create a data table where the accession number for a virus, species, and respective family could be viewed at once. This was surprisingly tricky. After much frustration, and many unnescessary forloops I was able to accomplish this.

Anyway, here is the (slightly embarassing) code I wrote to acquire the data for my table:

#Assign family names to each taxa IN ORDER of how accession numbers were concatenated (I physically counted how many of each...and double checked)
s1 <- rep("Filoviridae", 12)
s2 <- rep("Artoviridae",7 )
s3 <- rep("Bornaviridae",18)
s4 <- rep("Lispiviridae", 6)
s5 <- rep("Mymonaviridae", 4)
s6 <- rep("Nyamiviridae", 13)
s7 <- rep("Paramyxoviridae",72)
s8 <- rep("Pneumoviridae", 9)
s9 <- rep("Sunshinevirus", 1)
s10 <- rep("Xinmoviridae",8)
s11 <- rep("Rhabdoviridae", 70)

#combine into a vector (same length as virus_accession_vector)
fam <- c(s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11)
#create data frame with accession number and family. 
df <- base::data.frame(accession_number=virus_accession_vector, virus_family=fam)

#WE AIN"T DONE YET

#creating vector of organism names by splitting up the FASTA I just created and extracting from there with a billion forloops. 

#To tell you the truth I cannot even explain what my thought process was here... just trial error and tears I think. But in the end, it worked!
list_split <- list()
for (i in 1:length(virus_fasta_list)){list_split[[i]] <- str_split(virus_fasta_list[[i]],pattern = ',')
}
virus_names <- list()
for (i in 1:length(list_split)) {virus_names[[i]] <- purrr::map(list_split[[i]],1)

}
virus_species <- list()
virus_1 <- str_split(virus_names, "\\\\")
for (i in 1:length(virus_1)) {virus_species[[i]] <- purrr::map(virus_1[[i]][[1]],1)
}
virus_species <- unlist(virus_species)
virus_species <- str_remove(virus_species, "^...................")
virus_species <- str_remove(virus_species, "..$")

#overwriting my original data frame df, with a new one, where virus species is a column. 
df <- data.frame(df, virus_species=virus_species)

My Data Table

The data table of many tears!

## New names:
## Rows: 220 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): accession_number, virus_family, virus_species dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
…1 accession_number virus_family virus_species
1 NC_016144 Filoviridae Lloviu cuevavirus isolate Lloviu virus/M.schreibersii-wt/ESP/2003/Asturias-Bat86
2 NC_055510 Filoviridae Mengla dianlovirus isolate Rousettus-wt/CHN/2015/Sharen-Bat9447-1
3 NC_039345 Filoviridae Bombali ebolavirus isolate Bombali ebolavirus/Mops condylurus/SLE/2016/PREDICT_SLAB000156
4 NC_014373 Filoviridae Bundibugyo ebolavirus
5 NC_004161 Filoviridae Reston ebolavirus isolate Reston virus/M.fascicularis-tc/USA/1989/Philippines89-Pennsylvania
6 NC_006432 Filoviridae Sudan ebolavirus isolate Sudan virus/H.sapiens-tc/UGA/2000/Gulu-808892
7 NC_014372 Filoviridae Tai Forest ebolavirus isolate Tai Forest virus/H.sapiens-tc/CIV/1994/Pauleoula-CI
8 NC_002549 Filoviridae Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga
9 NC_001608 Filoviridae Marburg marburgvirus isolate Marburg virus/H.sapiens-tc/KEN/1980/Mt. Elgon-Musoke
10 NC_024781 Filoviridae Marburg marburgvirus isolate Ravn virus/H.sapiens-tc/KEN/1987/Kitum Cave-810040
11 NC_055175 Filoviridae Wenling frogfish filovirus strain XYHYS28627 nucleoprotein
12 NC_055176 Filoviridae Wenling thamnaconus septentrionalis filovirus strain LQMMTII17328 hypothetical protein
13 NC_032430 Artoviridae Beihai barnacle virus 8 strain BHTH14652 hypothetical protein 1
14 NC_032555 Artoviridae Beihai rhabdo-like virus 1 strain BHTSS15727 hypothetical protein 1
15 NC_032558 Artoviridae Beihai rhabdo-like virus 2 strain BHTSS7258 hypothetical protein 1
16 NC_055115 Artoviridae Hubei rhabdo-like virus 5 strain WHSFII19440 RNA-dependent RNA polymerase gene
17 NC_055114 Artoviridae Hubei rhabdo-like virus 6 strain QTM24798 hypothetical protein and RNA-dependent RNA polymerase genes
18 NC_055113 Artoviridae Hubei rhabdo-like virus 8 strain QTM19395 RNA-dependent RNA polymerase gene
19 NC_038269 Artoviridae Pteromalus puparum negative-strand RNA virus 1
20 NC_039013 Bornaviridae Jungle carpet python virus
21 NC_039014 Bornaviridae Southwest carpet python virus
22 NC_055169 Bornaviridae Wuhan sharpbelly bornavirus strain DSYS4497 nucleoprotein gene
23 NC_029642 Bornaviridae Aquatic bird bornavirus 1
24 NC_030691 Bornaviridae Aquatic bird bornavirus 2
25 NC_001607 Bornaviridae Borna disease virus 1
26 NC_030692 Bornaviridae Borna disease virus 2
27 NC_030690 Bornaviridae Canary bornavirus 1
28 NC_027892 Bornaviridae Canary bornavirus 2
29 NC_024296 Bornaviridae Canary bornavirus 3
30 NC_038268 Bornaviridae Estrildid finch bornavirus 1 isolate VS-4707 nucleoprotein (N)
31 NC_024778 Bornaviridae Loveridges garter snake virus 1 strain 251327
32 NC_039189 Bornaviridae Parrot bornavirus 1 strain M25 N protein (N)
33 NC_028106 Bornaviridae Parrot bornavirus 2
34 NC_030688 Bornaviridae Parrot bornavirus 4
35 NC_030689 Bornaviridae Parrot bornavirus 7
36 NC_039190 Bornaviridae Parrot bornavirus 5 strain 2014-A
37 NC_030701 Bornaviridae Variegated squirrel bornavirus 1
38 NC_031259 Lispiviridae Lishi Spider Virus 2 strain LSZZ-4 ORF1 (ORF1)
39 NC_032944 Lispiviridae Hubei odonate virus 10 strain QTM133243 hypothetical protein 1
40 NC_031270 Lispiviridae Tacheng Tick Virus 6 strain TCRP-2 ORF1 (ORF1)
41 NC_032929 Lispiviridae Hubei rhabdo-like virus 3 strain QCM135517 hypothetical protein 1
42 NC_033436 Lispiviridae Wuchan romanomermis nematode virus 2 strain WCLSXC55347 hypothetical protein 1
43 NC_031088 Lispiviridae Sanxia Water Strider Virus 4 strain SXSSP12 ORF1 (ORF1)
44 NC_043483 Mymonaviridae Sclerotinia sclerotiorum negative-stranded RNA virus 4 isolate 257 gp5
45 NC_032783 Mymonaviridae Hubei rhabdo-like virus 4 strain arthropodmix13990 hypothetical protein 1
46 NC_025383 Mymonaviridae Sclerotinia sclerotiorum negative-stranded RNA virus 1 strain AH98
47 NC_026732 Mymonaviridae Sclerotinia sclerotiorum negative-stranded RNA virus 3 isolate IL1 gp6
48 NC_043486 Nyamiviridae Beihai rhabdo-like virus 3 strain BHNXC39077 putative nucleoprotein
49 NC_032543 Nyamiviridae Beihai rhabdo-like virus 4 strain BHJP58499 putative nucleoprotein
50 NC_032544 Nyamiviridae Beihai rhabdo-like virus 5 strain BHJP63888 putative nucleoprotein
51 NC_032541 Nyamiviridae Beihai rhabdo-like virus 6 strain BHJJX49420 putative nucleoprotein
52 NC_032793 Nyamiviridae Wenling crustacean virus 12 strain WLJQ47777 putative nucleoprotein
53 NC_031275 Nyamiviridae Wenzhou Crab Virus 1 strain RBX2 nucleocapsid (N)
54 NC_012702 Nyamiviridae Midway virus
55 NC_012703 Nyamiviridae Nyamanini virus
56 NC_024376 Nyamiviridae Sierra Nevada virus
57 NC_043485 Nyamiviridae Orinoco virus strain UW1
58 NC_024702 Nyamiviridae Soybean cyst nematode midway virus N-protein
59 NC_033450 Nyamiviridae Wenzhou tapeworm virus 1 strain SGJSC14943 RNA-dependent RNA polymerase gene
60 NC_033451 Nyamiviridae Wenzhou tapeworm virus 1 strain SGJSC14943 hypothetical protein 1
61 NC_039017 Paramyxoviridae Antarctic penguin virus A
62 NC_039018 Paramyxoviridae Antarctic penguin virus B
63 NC_039019 Paramyxoviridae Antarctic penguin virus C
64 NC_025407 Paramyxoviridae Avian paramyxovirus 11 isolate common_snipe/France/100212/2010
65 NC_039230 Paramyxoviridae Avian paramyxovirus 2 strain APMV-2/Chicken/California/Yucaipa/56
66 NC_040796 Paramyxoviridae Avian paramyxovirus 20 isolate APMV-20/gull/Kazakhstan/5976/2014
67 NC_025361 Paramyxoviridae Avian paramyxovirus 5 strain budgerigar/Kunitachi/74
68 NC_039194 Paramyxoviridae Avian paramyxovirus 6 strain APMV-6/Goose/FarEast/4440/2003
69 NC_025347 Paramyxoviridae Avian paramyxovirus 7 strain APMV-7/dove/Tennessee/4/75
70 NC_039195 Paramyxoviridae Avian paramyxovirus 8 strain goose/Delaware/1053/76
71 NC_039223 Paramyxoviridae Newcastle disease virus isolate JSD0812
72 NC_025363 Paramyxoviridae Avian paramyxovirus 12 isolate Wigeon/Italy/3920_1/2005
73 NC_025390 Paramyxoviridae Avian paramyxovirus 9 strain duck/New York/22/1978
74 NC_025373 Paramyxoviridae Avian paramyxovirus 3 strain turkey/Wisconsin/68
75 NC_039015 Paramyxoviridae Avian paramyxovirus 14 isolate APMV14/duck/Japan/11OG0352/2011
76 NC_034968 Paramyxoviridae Avian paramyxovirus 15 isolate APMV-15/calidris_fuscicollis/Brazil/RS-1177/2012
77 NC_019531 Paramyxoviridae Avian paramyxovirus 4 strain APMV-4/duck/Delaware/549227/2010
78 NC_039016 Paramyxoviridae Avian paramyxovirus UPO216 isolate APMV-15/WB/Kr/UPO216/2014
79 NC_030231 Paramyxoviridae Avian paramyxovirus goose/Shimane/67/2000 viral cRNA
80 NC_025349 Paramyxoviridae Avian paramyxovirus penguin/Falkland Islands/324/2007
81 NC_005036 Paramyxoviridae Goose paramyxovirus SF02
82 NC_007803 Paramyxoviridae Beilong virus
83 NC_002161 Paramyxoviridae Bovine parainfluenza virus 3
84 NC_001921 Paramyxoviridae Canine distemper virus
85 NC_028362 Paramyxoviridae Caprine parainfluenza virus 3 strain JS2013
86 NC_025351 Paramyxoviridae Cedar virus isolate CG1a
87 NC_005283 Paramyxoviridae Dolphin morbillivirus
88 NC_039196 Paramyxoviridae Feline morbillivirus strain 761U
89 NC_005084 Paramyxoviridae Fer-de-lance virus
90 NC_025256 Paramyxoviridae Bat Paramyxovirus Eid_hel/GH-M74a/GHA/2009
91 NC_001906 Paramyxoviridae Hendra virus
92 NC_003461 Paramyxoviridae Human parainfluenza virus 1
93 NC_001796 Paramyxoviridae Human parainfluenza virus 3
94 NC_038270 Paramyxoviridae Simian Agent 10
95 NC_007454 Paramyxoviridae J-virus
96 NC_001498 Paramyxoviridae Measles virus
97 NC_025352 Paramyxoviridae Mojiang virus isolate Tongguan1
98 NC_005339 Paramyxoviridae Mossman virus
99 NC_043539 Paramyxoviridae Mount Mabu Lophuromys virus 1 isolate MOZ135_1
100 NC_043540 Paramyxoviridae Mount Mabu Lophuromys virus 2 isolate MOZ135_2
101 NC_001552 Paramyxoviridae Sendai virus genomic RNA
102 NC_017937 Paramyxoviridae Nariva virus
103 NC_002728 Paramyxoviridae Nipah virus
104 NC_006383 Paramyxoviridae Peste des petits ruminants virus complete geno
105 NC_028249 Paramyxoviridae Phocine distemper virus strain PDV/Wadden_Sea.NLD/1988
106 NC_055168 Paramyxoviridae Pohorje myodes paramyxovirus 1 isolate TT02/05
107 NC_025402 Paramyxoviridae Porcine parainfluenza virus 1 strain S206N
108 NC_006296 Paramyxoviridae Rinderpest virus (strain Kabete O)
109 NC_025386 Paramyxoviridae Salem virus
110 NC_025360 Paramyxoviridae Atlantic salmon paramyxovirus isolate ASPV/Yrkje371/95
111 NC_025355 Paramyxoviridae Tailam virus strain TL8K
112 NC_002199 Paramyxoviridae Tupaia paramyxovirus
113 NC_055167 Paramyxoviridae Bank vole virus strain RP-12
114 NC_025403 Paramyxoviridae Achimota virus 1
115 NC_025404 Paramyxoviridae Achimota virus 2
116 NC_055508 Paramyxoviridae Alston virus isolate Alstonville
117 NC_038271 Paramyxoviridae Bat Paramyxovirus Epo_spe/AR1/DRC/2009
118 NC_055459 Paramyxoviridae UNVERIFIED: Hervey virus isolate BO6 genomic sequen
119 NC_003443 Paramyxoviridae Human rubulavirus 2
120 NC_021928 Paramyxoviridae Human parainfluenza virus 4a viral cRNA
121 NC_009489 Paramyxoviridae Mapuera virus
122 NC_039197 Paramyxoviridae Menangle virus isolate Australia/bat/2009/Cedar Grove
123 NC_002200 Paramyxoviridae Mumps orthorubulavirus genomic RNA
124 NC_006430 Paramyxoviridae Parainfluenza virus 5 strain W
125 NC_009640 Paramyxoviridae Porcine rubulavirus
126 NC_006428 Paramyxoviridae Simian virus 41
127 NC_025343 Paramyxoviridae Sosuga virus isolate 2012
128 NC_039198 Paramyxoviridae Teviot virus isolate Cedar Grove
129 NC_004074 Paramyxoviridae Tioman virus
130 NC_025410 Paramyxoviridae Tuhoko virus 1
131 NC_025348 Paramyxoviridae Tuhoko virus 2
132 NC_025350 Paramyxoviridae Tuhoko virus 3
133 NC_039231 Pneumoviridae Avian pneumovirus strain LAH A
134 NC_039199 Pneumoviridae Human metapneumovirus isolate 00-1
135 NC_001989 Pneumoviridae Bovine orthopneumovirus
136 NC_038272 Pneumoviridae Bovine respiratory syncytial virus ATCC51908
137 NC_001803 Pneumoviridae Respiratory syncytial virus
138 NC_038235 Pneumoviridae Human orthopneumovirus Subgroup A
139 NC_001781 Pneumoviridae Human orthopneumovirus Subgroup B
140 NC_006579 Pneumoviridae Pneumonia virus of mice J3666
141 NC_025344 Pneumoviridae Pneumovirus dog/Bari/100-12/ITA/2012
142 NC_025345 Sunshinevirus Sunshine virus
143 NC_033055 Xinmoviridae Hubei diptera virus 11 strain SCM51501 hypothetical protein 1
144 NC_055111 Xinmoviridae Shuangao Fly Virus 2 strain QSA05 RNA-dependent RNA polymerase (L) gene
145 NC_031244 Xinmoviridae Xincheng Mosquito Virus strain XC1-6 ORF1 (ORF1)
146 NC_035133 Xinmoviridae Culex mononega-like virus 2 strain mos191gb23464
147 NC_033027 Xinmoviridae Hubei rhabdo-like virus 7 strain QTM26925 hypothetical protein
148 NC_043484 Xinmoviridae Drosophila unispina virus 1 ORF1
149 NC_055112 Xinmoviridae Bolahun virus variant 2 putative nucleoprotein
150 NC_032849 Xinmoviridae Hubei orthoptera virus 5 strain ZCM94262 hypothetical protein 1
151 NC_028246 Rhabdoviridae Adelaide River virus isolate DPP61
152 NC_025391 Rhabdoviridae Almpiwar virus isolate MRM4059
153 NC_022755 Rhabdoviridae American bat vesiculovirus TFFN-2013 isolate liver2008
154 NC_034535 Rhabdoviridae Barur virus nucleoprotein
155 NC_025358 Rhabdoviridae Berrimah virus strain DPP 63
156 NC_002526 Rhabdoviridae Bovine ephemeral fever virus
157 NC_034550 Rhabdoviridae Chaco virus nucleoprotein
158 NC_020805 Rhabdoviridae Chandipura virus isolate CIN 0451
159 NC_025397 Rhabdoviridae Coastal Plains virus strain DPP53
160 NC_028255 Rhabdoviridae Cocal virus Indiana 2
161 NC_055509 Rhabdoviridae Cuiaba virus strain BeAn 227841
162 NC_025406 Rhabdoviridae Dolphin rhabdovirus isolate pxV1 1992
163 NC_038284 Rhabdoviridae Durham virus nucleocapsid
164 NC_022581 Rhabdoviridae Eel Virus European X complete genome
165 NC_009527 Rhabdoviridae European bat lyssavirus 1
166 NC_009528 Rhabdoviridae European bat lyssavirus 2 isolate RV1333
167 NC_025253 Rhabdoviridae Farmington virus strain CT 114
168 NC_025341 Rhabdoviridae Fikirini bat rhabdovirus isolate KEN352
169 NC_028867 Rhabdoviridae Fox fecal rhabdovirus isolate S40 nucleocapsid protein
170 NC_031988 Rhabdoviridae Gannoruwa bat lyssavirus isolate RV3266
171 NC_055530 Rhabdoviridae Garba virus nucleoprotein
172 NC_025376 Rhabdoviridae Grass carp rhabdovirus V76
173 NC_039202 Rhabdoviridae Flanders virus nucleoprotein
174 NC_018629 Rhabdoviridae Ikoma lyssavirus
175 NC_001652 Rhabdoviridae Infectious hematopoietic necrosis virus
176 NC_020806 Rhabdoviridae Isfahan virus N gene
177 NC_034451 Rhabdoviridae Kern Canyon virus nucleoprotein
178 NC_034540 Rhabdoviridae Keuraliba virus nucleoprotein
179 NC_025396 Rhabdoviridae Kimberley virus isolate CS368
180 NC_034549 Rhabdoviridae Klamath virus nucleoprotein
181 NC_028239 Rhabdoviridae Koolpinyah virus isolate DPP819
182 NC_028236 Rhabdoviridae Kumasi rhabdbovirus
183 NC_038277 Rhabdoviridae Trout rhabdovirus 903/87 nucleocapsid protein
184 NC_034533 Rhabdoviridae Landjia virus nucleoprotein
185 NC_038276 Rhabdoviridae Nishimuro virus viral cRNA for hypothetical proteins
186 NC_031955 Rhabdoviridae Lleida bat lyssavirus isolate RV3208
187 NC_020808 Rhabdoviridae Aravan virus
188 NC_003243 Rhabdoviridae Australian bat lyssavirus
189 NC_025251 Rhabdoviridae Bokeloh bat lyssavirus isolate 21961
190 NC_025377 Rhabdoviridae West Caucasian bat virus
191 NC_020810 Rhabdoviridae Duvenhage virus isolate 86132SA
192 NC_020809 Rhabdoviridae Irkut virus
193 NC_025385 Rhabdoviridae Khujand lyssavirus
194 NC_020807 Rhabdoviridae Lagos bat virus isolate 0406SEN
195 NC_006429 Rhabdoviridae Mokola virus
196 NC_001542 Rhabdoviridae Rabies virus
197 NC_025365 Rhabdoviridae Shimoni bat virus
198 NC_034530 Rhabdoviridae Marco virus nucleoprotein
199 NC_034545 Rhabdoviridae Mount Elgon bat virus nucleoprotein
200 NC_005093 Rhabdoviridae Hirame rhabdovirus
201 NC_034548 Rhabdoviridae Oita virus nucleoprotein
202 NC_020803 Rhabdoviridae Perch perhabdovirus isolate PRV nucleoprotein (N) gene
203 NC_038286 Rhabdoviridae Piry virus strain BeAn2423
204 NC_025387 Rhabdoviridae Scophthalmus maximus rhabdovirus
205 NC_034529 Rhabdoviridae Sena Madureira virus nucleoprotein
206 NC_008514 Rhabdoviridae Siniperca chuatsi rhabdovirus
207 NC_000903 Rhabdoviridae Snakehead rhabdovirus complete geno
208 NC_002803 Rhabdoviridae Spring viraemia of carp virus
209 NC_025356 Rhabdoviridae Pike fry rhabdovirus isolate F4
210 NC_025401 Rhabdoviridae Sunguru virus isolate Ug#41
211 NC_055474 Rhabdoviridae Taiwan bat lyssavirus isolate TWBLV/TN/2016
212 NC_025371 Rhabdoviridae Tench rhabdovirus S64
213 NC_020804 Rhabdoviridae Tibrogargan virus strain CS132
214 NC_007020 Rhabdoviridae Tupaia virus
215 NC_043538 Rhabdoviridae Vaprio virus nucleoprotein
216 NC_025353 Rhabdoviridae Vesicular stomatitis Alagoas virus Indiana 3
217 NC_001560 Rhabdoviridae Vesicular stomatitis Indiana virus
218 NC_038236 Rhabdoviridae Vesicular stomatitis Indiana virus strain 98COE
219 NC_024473 Rhabdoviridae Vesicular stomatitis New Jersey virus isolate NJ1184HDB
220 NC_000855 Rhabdoviridae Viral hemorrhagic septicemia virus Fil3

Step 5: Sequence Alignment

Making Sequences Useable

The first thing I need to do to get a good alignment, is clean my sequences into a useable format. Luckily for me, this is a fairly straightforward process and there are several good functions to help like fasta_cleaner from the package campbio4all. In esssence I need to make all my sequences into a single character vector of length 220. This means each individual component of the vector needs to be a single character string of a complete genome.

To accomplish this first I will use the function fasta cleaner on each element in virus fasta list:

genome_list <- list()
for(i in 1:length(virus_fasta_list)){
  genome_list[[i]] <- fasta_cleaner(virus_fasta_list[[i]], parse = F)
}

Then, I need to create an empty vector of length 220. This will literally be a vector of 220 NA’s. Dont worry. I will fill it up later with sequences!

Note: genome list is the same length as the number of genome sequences in my FASTA file, which is 220. So saying virus_vector = rep(NA, 220) and virus_vector = rep(NA, length(genome_list)) accomplishes the same thing.

virus_vector <- rep(NA, length(genome_list))

Next, I replace each NA in my vector with a complete sequence using a forloop

for(i in 1:length(virus_vector)){
  virus_vector[i] <- genome_list[[i]]}

Last, I add the names of the virus species (that I extracted earlier) to the virus vector:

names(virus_vector) <- virus_species

Heres a glimpse of that long character vector:

glimpse(virus_vector)
##  chr [1:220] "CACACAAAAAGAAAAAAAGTTTTTTTAACCTTTTTGTGTCCTTAATACCTTGAAGAATATTAATCTCTTCCCGGAAATCGAGATCGAGATCTGGCTTCAAGTAACACTAAT"| __truncated__ ...

Making a stringset using BioStrings

Converting my newly made vector into a Biostrings DNA stringset. It is important to note that Mononegavirales genomes are RNA so any sequence extracted is going to be the compliment of the actual genome. This doesn’t affect any part of in phylogenetic tree construction as it is based on nucleotide differentiation between genomes and because their nucleotides are all complimented the differences would be same. The original sequences pulled from Genbank were actually in DNA format (using A, C, T, G), and so they too are the compliment of the viral genomes.

Here is the code to construct the stringset:

virus_ss <- Biostrings::DNAStringSet(virus_vector)

Here is the stringset:

virus_ss
## DNAStringSet object of length 220:
##       width seq                                             names               
##   [1] 18927 CACACAAAAAGAAAAAAAGTTT...GAATTAAGAAAAAACTAGTTGT Lloviu cuevavirus...
##   [2] 18300 GAGGAATATTAATTTTTATAAG...GTGAAAGAATATTAAGAAAAAA Mengla dianloviru...
##   [3] 19043 CGGACACACAAAAAGAAAGAAG...TTTTTTCCTTTTTGTGTGTCCT Bombali ebolaviru...
##   [4] 18940 CGGACACACAAAAAGAATGAAG...TTTTCTCTTTTTTGTGTGTCCA Bundibugyo ebolav...
##   [5] 18891 CGGACACACAAAAAGAAAAAAG...ATTTTTTCCTTTTTGTGTGTCC Reston ebolavirus...
##   ...   ... ...
## [216] 11070 ACGAAGACAAACAAACCATTTA...TAGCTGGTTTTGTTGTTTCCGT Vesicular stomati...
## [217] 11161 ACGAAGACAAACAAACCATTAT...TATCTGGTTTTGTGGTCTTCGT Vesicular stomati...
## [218] 11162 ACGAAGACAAACAAACCATTAT...TATCTGGTTTTGTGGTCTTCGT Vesicular stomati...
## [219] 11123 ACGAAGACAAAAAAACCATTAT...GAATTGGTTTTTTTGTCTTCGT Vesicular stomati...
## [220] 11158 GTATCATAAAAGATGATGAGTT...TGGTATGTATTATTTTCTATAC Viral hemorrhagic...

Aligning sequences…FINALLY!!!

How genomes align relative to each other give the information needed to construct a phylogenetic tree. Differences in the alignment from one genome to another reflect evolutionary differentiation. This evolutionary differentiation allocates to the ‘distance’ between two genomes. A phylogeny is an informative diagram of virus’s evolutionary relationships or distances to and from each other, less genomically ‘distant’ viruses are placed closer together on the tree while more distant viruses are placed farther apart on the tree. Positions in the tree are assigned by best fit models that can interpret genomic alignments.

Virus alignment was performed using the alignseqs() function from the DECIPHER package.

Here is the code for the alignment:

virus_align <- DECIPHER::AlignSeqs(virus_ss)

Here is a peek of the alignment:

virus_align
## DNAStringSet object of length 220:
##       width seq                                             names               
##   [1] 47301 ----------------------...---------------------- Lloviu cuevavirus...
##   [2] 47301 ---------GAGGAATATTAAT...---------------------- Mengla dianloviru...
##   [3] 47301 -------------------CGG...---------------------- Bombali ebolaviru...
##   [4] 47301 -------------------CGG...---------------------- Bundibugyo ebolav...
##   [5] 47301 -------------------CGG...---------------------- Reston ebolavirus...
##   ...   ... ...
## [216] 47301 ----------------------...---------------------- Vesicular stomati...
## [217] 47301 ----------------------...---------------------- Vesicular stomati...
## [218] 47301 ----------------------...---------------------- Vesicular stomati...
## [219] 47301 ----------------------...---------------------- Vesicular stomati...
## [220] 47301 -------------------GTA...---------------------- Viral hemorrhagic...

By default the alignseqs() function creates a DNAstringSet object. Although I can use this object to create basic phylogenetic trees with distance based methods, I will need to convert to a DNAMultipleAlignment object moving forward to create more complex and representative trees using model optimization and other methods.

This is an easy conversion. I will keep virus_align as a DNAStringSet and create a DNAMultipleAlignment called `Multiple_virus_align, so I can still utilize both.

multiple_virus_align <- DNAMultipleAlignment(virus_align)

Heres a look at the multiple alignment:

multiple_virus_align
## DNAMultipleAlignment with 220 rows and 47301 columns
##        aln                                                  names               
##   [1] -----------------------CA...------------------------ Lloviu cuevavirus...
##   [2] ---------GAGGAATATTAATTTT...------------------------ Mengla dianloviru...
##   [3] -------------------CGGACA...------------------------ Bombali ebolaviru...
##   [4] -------------------CGGACA...------------------------ Bundibugyo ebolav...
##   [5] -------------------CGGACA...------------------------ Reston ebolavirus...
##   [6] -------------------CGGACA...------------------------ Sudan ebolavirus ...
##   [7] -------------------CGGACA...------------------------ Tai Forest ebolav...
##   [8] -------------------CGGACA...------------------------ Zaire ebolavirus ...
##   [9] ---------------AGACACACAA...------------------------ Marburg marburgvi...
##   ... ...
## [212] -------------------------...------------------------ Tench rhabdovirus...
## [213] -------------------------...------------------------ Tibrogargan virus...
## [214] -------------------------...------------------------ Tupaia virus
## [215] -------------------------...------------------------ Vaprio virus nucl...
## [216] -------------------------...------------------------ Vesicular stomati...
## [217] -------------------------...------------------------ Vesicular stomati...
## [218] -------------------------...------------------------ Vesicular stomati...
## [219] -------------------------...------------------------ Vesicular stomati...
## [220] -------------------GTATCA...------------------------ Viral hemorrhagic...

The function ggmsa gives a more visual representation of the genome alignment. Here I have selected a segment from nucleotide 2060 - 2080, a region of much differentiation between viruses. This goes to show how different the genomes of viruses, even within the same families, and orders can be at strategic locations. The colored boxes reflect where nucleotides overlap between sequences and species.

ggmsa(multiple_virus_align, start = 2060, end = 2080)

Step 6: Constructing Phylogenetic Trees

At last my friends, we begin to make phylogenetic trees!

Constructing a Phylogenetic Tree Using Distance Based Methods

To begin with we will construct a basic phylogenetic tree using neighbor join method NJ. Neighbor join method takes in a large stringset that has been aligned such as virus align that I created and uses a distance matrix to create the tree.

To start my Neighbor Join tree I first need to create a distance matrix. I will do this with the DECIPHER package using the `Distance matrix function.

Here is the code:

virus_dist <- DECIPHER::DistanceMatrix(virus_align, type='matrix')

Make it into a distance matrix:

virus_dist <- as.dist(virus_dist)

Here’s a glimpse at the values:

virus_dist[1:10]
##  [1] 0.6980494 0.7022852 0.7014216 0.7044919 0.7026391 0.7032094 0.7017479
##  [8] 0.7008485 0.6968324 0.7466097

round the values in the matrix:

virus_dist_rounded <- round(virus_dist, digits=3)

A glimpse at the rounded values:

virus_dist_rounded[1:10]
##  [1] 0.698 0.702 0.701 0.704 0.703 0.703 0.702 0.701 0.697 0.747

In hindsight the actual values don’t matter that much, its the tree that does the explaining!

To create the tree we use the function nj() from the package ape and insert the non rounded distances virus_dist

treeNJ <- nj(virus_dist)

Here’s a zoomed out view of the resulting tree. The point is to be able to see the whole thing all at once! I know it’s huge!

Below you can scroll over the tree to zoom in on organism names

Neighbor joining is not the most accurate or representative method of creating a phylogenetic tree however. To construct my best phylogenetic tree I will use a different method than a basic neighbor join.

Building a Phylogenetic Tree From an Optimized Model

To build the best phylogenetic tree possible I will use an optimized model. By selecting a best fit model for my data to base my tree upon I will be able to build the most accurate phylogenetic tree possible.

To begin with I will import cleaned aligned sequences (from earlier) that I previously saved into a FASTA file with the function write.fasta. For the purposed of this assignment I will not show the code to write the FASTA as I feel it is unnecessary, but I will show importing the FASTA. I will save the imported sequences as the object viruses for the purpose of identifying that the aligned sequences are in FASTA format.

viruses <- read.phyDat(file = "./Data/virus_align.FASTA",
                        format = "fasta")

From here I can construct several different trees like a upgma based on distance methods dm or a tree using pratchet. However these would still be relatively simple trees. And I want a better one

One way to measure how ‘good’ your phylogenetic tree is is with a parsimony score. The lower the parsimony score, the more representative or your total data set the tree is. The parsimony score is a measure of how many deletions or changes had to be made to your data (in my case 220 genome sequences of 8,000 - 20, 000 nucleotides each) in order to construct your tree.

to calculate the parsimony score for my previous tree treeNJ I can simply run the function parsimony from the phangorn package compared to my sequence data viruses

parsimony(treeNJ, viruses)
## [1] 1031966

As you can see, over 1 million changes had to be made to my data to create this tree. In other words this tree SUCKS lets see if we can make it suck less :)

First, lets see what the parsimony scores would be for upgma pratchet and even random additon methods, just for curiosity.

Building upgma random addition and pratchet trees using phanghorn.

dm  <- dist.ml(viruses)
#building the same distance matrix I made before, just in a different way. 
treeUPGMA  <- upgma(dm)
treeRA <- random.addition(viruses)
treeRatchet  <- pratchet(viruses, trace = 0, minit=100)

And finding their parsimony scores:

parsimony(c(treeUPGMA,treeRA, treeRatchet), viruses)
## [1] 942929 939453 937443

These are down to under 1 million. Which might be about the best we can do… it looks like treeRAtchet is the best so far. Let’s see if we can do any better using a maximum likelihood approach by selecting a best fit model.

To build a better tree the modeltest mt function from the Phangorn package.This is extremely computationally heavy:

mt <- modelTest(viruses)

This gives a list of possible models, and their performance values (like AIC). Here’s the first 10 rows of the resulting data frame. (it is over 92 rows long)

mt[1:10,]
##         Model  df   logLik     AIC AICw    AICc AICcw     BIC
## 1          JC 437 -3130390 6261654    0 6261662     0 6265484
## 2        JC+I 438 -3130390 6261656    0 6261664     0 6265495
## 3     JC+G(4) 438 -3049505 6099886    0 6099894     0 6103724
## 4   JC+G(4)+I 439 -3049505 6099888    0 6099896     0 6103735
## 5         F81 440 -3125273 6251425    0 6251434     0 6255282
## 6       F81+I 441 -3125273 6251427    0 6251436     0 6255292
## 7    F81+G(4) 441 -3040042 6080966    0 6080974     0 6084831
## 8  F81+G(4)+I 442 -3040042 6080968    0 6080976     0 6084841
## 9         K80 438 -3119213 6239303    0 6239311     0 6243141
## 10      K80+I 439 -3119213 6239305    0 6239313     0 6243152

This next function takes over 24 hours to run. It selects at optimized model from mt to create a the most representative tree possible from my data.

fit_mt <- pml_bb(mt, control = pml.control(trace = 0))

Here is the output of fit_mt. It shows the model selected and relative rates of each nucleotide which is pretty informative!

fit_mt
## model: GTR+G(4) 
## loglikelihood: -3017233 
## unconstrained loglikelihood: -419643.9 
## Discrete gamma model
## Number of rate categories: 4 
## Shape parameter: 1.499813 
## 
## Rate matrix:
##          a        c        g        t
## a 0.000000 2.174429 2.787524 1.404949
## c 2.174429 0.000000 1.518994 3.257664
## g 2.787524 1.518994 0.000000 1.000000
## t 1.404949 3.257664 1.000000 0.000000
## 
## Base frequencies:  
##         a         c         g         t 
## 0.3046548 0.2077264 0.2137143 0.2739046

I can access the tree generated by calling:

fit_mt$tree
## 
## Phylogenetic tree with 220 tips and 218 internal nodes.
## 
## Tip labels:
##   Lloviu cuevavirus isolate Lloviu virus/M.schreibersii-wt/ESP/2003/Asturias-Bat86, Pteromalus puparum negative-strand RNA virus 1, Mengla dianlovirus isolate Rousettus-wt/CHN/2015/Sharen-Bat9447-1, Hubei rhabdo-like virus 5 strain WHSFII19440 RNA-dependent RNA polymerase gene, Kern Canyon virus nucleoprotein, Bovine orthopneumovirus, ...
## Node labels:
##   1, 1, 1, 1, 1, 0.994, ...
## 
## Unrooted; includes branch lengths.

Now to compare the parsimony scores:

parsimony(c(treeRA, treeNJ, treeRatchet, fit_mt$tree), viruses)
## [1]  939453 1031966  937443  940039

Just based on parsimony score, it appears that treeRA and treeRatchet may outperform fit_mt$tree. But that is not necessarily the case. Parsimony score is only based on changes in the data. Not necessarily how representative of the data the tree actually is. Because fit_mt$tree was generated from the best fitting model of over 95 simulated models, I have a gut feeling it is going to be the best tree, because it most likely is using the best model. 95 models were tested against the data using the mt and pml_bb method and only one model was used with the other methods, so I choose the fit_mt$tree. Basically parsimony score gives us an idea of how well the tree represents the data but it isn’t everything, its just one measure and several of the methods have very close parsimony scores nonetheless. Based on just parsimony score however treeRatchet is the front runner.

Another thing that gives fit_mt$tree an advantage is that ultrafast bootstrapping is also conducted by default. Bootstrapping is yet another way to test for model validity. Bootstrap values in essence tell how many times out of 100 the same branch in the same position would be observed with re sampled data, this is expressed as a percentage in decimal form. A boot strap value above .95 is best.

Lets have a look at fit_mt$tree:

This phylogenetic tree shows the evolutionary relationship of viruses in the order Mononegavirales. Nodes (spots where 2 branches/lines diverge) represent that last common ancestor of that species. Viruses that share a more recent common ancestor, are more closely related. Viruses that have a common ancestor farther back (toward the left) on the tree are less related. Thus time should be thought of as passing from the left to the right on this tree. Organisms farther to the right are more recently diverged, and organisms toward the left are less recently diverged.

Below you can scroll over the tree to zoom in on organism names

Let’s explore this phylogeny in a few other formats. You can build a phylogenetic tree which can be navigated with the package Phylocanvas.

Here is a circular tree:

There’s also a diagonal format (this one is less readable:

And radial format (even less readable):

So What’s The Big Deal?

Now is the time that we get to admire the significance, and explanatory power that phylogenetic trees provide.

To accomplish this, lets look back at how the different families of viruses fit within my phylogeny. I am going to use the rectangular format because it is easier to read and track.

The ape package includes the function zoom which allows us to pinpoint specific subsets of organisms within a tree. Let’s hone in on each of our families and see where they fit in the evolutionary history of Mononegavirales.

To do this I am going to create a subset from my original data table of viruses that are in each family. I am then going to ‘zoom’ in on these viruses in my optimized model generated tree fit_mt$tree. Here is the code for a zoom on the family Filoviridae:

As you can see the Filoviruses make a clade toward the center of the tree. This makes sense seeing as the Filoviruses comprise a family and so they are going to be more related to eachother than viruses of other families.

This is a sign that my phylogenetic tree is strong, and a good fit for my data set.

Let’s do it for the other families.

Artoviridae:

The Artoviridae are clustered together as well in a small clade toward the top of the tree. This means the are very closely related (evolutionarily speaking)

Bornaviridae:

The Bornaviruses again show clustering in a clade. It is very clear that all Bornaviruses descend from one single node. Thus all Bornaviruses share one single and distinct common ancestor that virusese from no other family in the order Mononegavirales share.

Lispiviridae:

{r. out.width=1000, out.height=600, echo=FALSE} born <- df %>% filter(virus_family=="Bornaviridae") born <- born$virus_species ape::zoom(fit_mt$tree, born, no.margin=T, cex=.5)

Lispiviruses have one basal taxon (one organism descending by itself from a node; a single branch descending from a node) and a clade descending from one single last common ancestor. The basal taxon may be perhaps less derived, or the most primitive of the lispiviruses, while the clade represents more and more divergence as one looks on the tree from left to right.

Mymonaviridae:

Mymonaviruses are quite interesting. 2 distinct clades descend from 2 distinct common ancestors. This shows that viruses in the clade Lispiviridae despite being in the same familya are actually quite diverse. Evolutionarily speaking.

Nyamiviridae:

Nyamiviruses are quite diverse as well. There are two main clades descending from a single node, but it seems that viruses in this family are numerous and have become more and more derived over time. It would be scientifically intersting to study viruses from these two different clades. Species and family demarcation criteria is set by the ICTV (International committee on the taxonomy of viruses), it would be especially intriguing to compare these classifications to the actual characteristic, and phenotypic differences of these viruses, seeing as some of them are so evolutionarily distant from each other.

Paramyxoviridae:

Paramyxoviruses comprise the center portion of the tree. It seems there are two main clades, with several subclades branching off from each. Thus, each paramyxovirus is part of 1 of two main lineages of virus.

Pneumoviridae:

Pneumoviruses again form one main cluster or clade of taxa. Thus all of these viruses are very closely related, and there is not as much diversity within the family as with other families (again evolutionarily speaking)

Sunshinevirus: There is only one sunshine virus in the data set, so the label will not be shown. The virus name is ironically “Sunshine Virus”. The red line is where it belongs in the tree.

It seems like the Sunshine virus is just hanging out in the middle of what will farther on be classified as the Rhabdoviruses. Obviously it must share some similarity genomically with the Rhabdoviruses and must be most closely related to the Rhabdoviruses based on this phylogeny and its placement. A phylogeny is again a visual, spacial, hypothesis.

Xinmoviridae:

Xinmoviruses have 3 main branches, this family however contains very few viruses, perhaps that is just because full genomes have not been sequenced yet, or the virus is not very prevalent. Either way, it seems each Xinmovirus belongs to one of 3 lineages.

Rhabdoviridae:

Last but not least are Rhabdoviruses. The representation of Rhabdoviruses in this tree does justice to what they truly are. Rhabdoviruses are a monstrous clade of very diverse viruses. There are many clades and subclades, and like in this picture, there are not really any main clades that truly stand out. Rhabdoviruses are kind of this big mess of extremely diverse, extremely vast viruses. Some Rhabdoviruses even exist in plant host and are vectored by arthropods (not included in this data set). Its interesting to see that even though Rhabdoviruses are such a giant in the Mononegavirales world they do remain clustered together, there is something that they share genomically that makes them Rhabdoviruses not just characteristically. Genomics are truly just another piece in the never ending puzzle of the human understanding of biology.

Conclusion

Phylogenies and their visual representation as phylogenetic trees are a tool of ever growing importance to the Biology, Ecology, Virology, Epidemiology, Entomology and their respective communities. These trees are an easy to understand, visual way of representing the evolutionary relationships between organisms, which has many important, useful and beneficial applications, especially to the preservation of public health. In a world of continual concern, understanding the phylogenetic histories and relationships of potentially dangerous viral agents is crucial to sustaining healthy communities around the globe.